Disclosure Risk Evaluation for Fully Synthetic Categorical Data
نویسندگان
چکیده
We present an approach for evaluating disclosure risks for fully synthetic categorical data. The basic idea is to compute probability distributions of unknown confidential data values given the synthetic data and assumptions about intruder knowledge. We use a “worst-case” scenario of an intruder knowing all but one of the records in the confidential data. To create the synthetic data, we use a Dirichlet process mixture of products of multinomial distributions, which is a Bayesian version of a latent class model. In addition to generating synthetic data with high utility, the likelihood function admits simple and convenient approximations to the disclosure risk probabilities via importance sampling. We illustrate the disclosure risk computations by synthesizing a subset of data from the American Community Survey.
منابع مشابه
Bayesian Non-parametric Generation of Fully Synthetic Multivariate Categorical Data in the Presence of Structural Zeros
Statistical agencies are increasingly adopting synthetic data methods for disseminating microdata without compromising the privacy of respondents. Crucial to the implementation of these approaches are flexible models, able of capturing the nuances of the multivariate structure present in the original data. In the case of multivariate categorical data, preserving this multivariate structure also...
متن کاملLocal synthesis for disclosure limitation that satisfies probabilistic k-anonymity criterion
Before releasing databases which contain sensitive information about individuals, data publishers must apply Statistical Disclosure Limitation (SDL) methods to them, in order to avoid disclosure of sensitive information on any identifiable data subject. SDL methods often consist of masking or synthesizing the original data records in such a way as to minimize the risk of disclosure of the sensi...
متن کاملPeGS: Perturbed Gibbs Samplers that Generate Privacy-Compliant Synthetic Data
This paper proposes a categorical data synthesizer algorithm that guarantees a quantifiable disclosure risk. Our algorithm, named Perturbed Gibbs Sampler (PeGS), can handle highdimensional categorical data that are intractable if represented as contingency tables. PeGS involves three intuitive steps: 1) disintegration, 2) noise injection, and 3) synthesis. We first disintegrate the original dat...
متن کاملPreserving Edits When Perturbing Microdata for Statistical Disclosure Control Ntalie Shlomo, Ton De Waal
To protect individuals in microdata from the risk of re-identification, a general perturbative method called PRAM (the Post-Randomization Method) is sometimes used for masking records. This method adds “noise” to categorical variables by changing values of categories for a small number of records according to a prescribed probability matrix and a stochastic process based on the outcome of a ran...
متن کاملMarginality: a numerical mapping for enhanced treatment of nominal and hierarchical attributes
The purpose of statistical disclosure control (SDC) of microdata, a.k.a. data anonymization or privacy-preserving data mining, is to publish data sets containing the answers of individual respondents in such a way that the respondents corresponding to the released records cannot be re-identified and the released data are analytically useful. SDC methods are either based on masking the original ...
متن کامل